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ABSTRACT 

This  paper  addresses  the  problem  of  related  entity  finding,  which 
was  proposed  in  tree  2009.  The  overall  aim  of  related  entity 
finding  (REF)  is  to  perform  entity-related  search  on  Web  data, 
which  address  common  infonnation  needs  that  are  not  that  well 
modeled  as  ad  hoc  document  search.  In  this  paper,  a  novel 
framework  was  proposed  based  on  a  probabilistic  model  for 
related  entity  finding  in  a  Web  collection.  This  model  consists  of 
two  parts.  One  is  the  probability  indicating  the  relation  between 
the  source  entity  and  the  candidate  entities.  The  other  is  the 
probability  indicating  the  relevance  between  the  candidate  entities 
and  the  topic.  Using  ClueWeb09  dataset,  the  experimental 
evaluations  show  the  effectiveness  of  our  REF  framework. 

1.  INTRODUCTION 

With  rapid  development  of  World  Wide  Web,  search  engines  have 
become  an  important  tool  for  users  to  find  information  they  need. 
Traditional  information  retrieval  systems  return  a  list  of 
documents  for  users.  However,  often,  users  search  for  specific 
entities  instead  of  just  any  type  of  documents.  For  instance,  when 
users  submit  a  query  "Michael’s  teammates  while  he  was  racing  in 
Formula  1" ,  they  might  expect  to  find  out  the  names  of  Michael’s 
teammates.  Under  related  entity  finding  (REF),  users  can  directly 
obtain  target  entities  with  no  need  of  exploring  a  large  number  of 
documents. 

Tree  2009  highlighted  the  interests  in  related  entity  finding.  TREC 
ENTITY  task  addresses  common  information  needs  that  are  not 
that  well  modeled  as  adhoc  document  search.  The  overall  aim  of 
this  task  is  to  perform  entity-related  search  on  Web  data,  where  31 
queries  are  built  (11  queries  for  training  and  20  queries  for  testing). 
As  an  example  of  the  related  entity  finding  task,  given  the  source 
entity  "Michael  Schumacher",  it  aims  to  find  all  target  entities  that 
are  related  to  the  source  entity  "Michael  Schumacher" ,  where  the 
relation  is  described  by  the  narrative  " Michael’s  teammates  while 
he  was  racing  in  Formula  1” . 

In  this  paper,  we  proposed  a  novel  framework  for  the  related 
entity  finding  task  based  on  a  probabilistic  model.  Specifically,  all 
candidate  entities  are  ranked  by  the  probability  P(e  \  Q )  ,  where 
e  donates  candidate  entity,  Q  donates  the  query  which  are  can  be 
represented  by  a  triple  ( e'  ,R,T )  ,  and  P(e  \  Q)  donates  the 
probability  of  Q  generating  e  .  P(e  \  Q )  can  be  computed  by 
multiplying  two  probability.  One  is  the  probability 
P(R  |  e,  e' )  indicating  the  relation  between  the  source  entity 
and  the  candidate  entities.  The  other  is  the  probability 
P(e  |  e' ,  T)  indicating  the  relevance  between  the  candidate 
entities  and  the  topic.  Note  that  in  triple  (es ,  R,T)  ,  (P  donates 


the  source  entity  (" Michael  Schumacher"  in  previous  example), 
R  donates  the  relation  words  ("teammates"  in  previous  example), 
and  T  donates  the  type  of  target  entities  ("Person"  in  previous 
example).Using  ClueWeb09  dataset.  The  experimental  evaluations 
show  the  effectiveness  of  our  REF  framework  based  on  the 
probabilistic  model. 

2.  REF  Framework 

2.1  Preliminary 

Given  an  input  entity,  by  its  name  and  homepage,  the  type  of  the 
target  entity,  as  well  as  the  nature  of  their  relation,  described  in 
free  text,  find  related  entities  that  are  of  target  type,  standing  in 
the  required  relation  to  the  input  entity.  An  example  infonnation 
need,  ‘ 'find  organizations  that  are  sponsoring  Kimi  Rciikkonen ”  is 
formulated  as  follows: 

<query> 

<num>\<lnum> 

<entity_name>  Kimi  R2akkon&n<l entity  _ncime> 

<entity_URL>clueweb09-en0000-00-l2345</entity_URL> 

<target_entity>oxg&mza\ion<ltarget_entity> 

<narrative>l’ d  like  to  know  which  organizations  are  Kimi’s 
sponsors. <//j<7;T<7h've> 

</query> 

2.2  EF  Framework 

In  this  section,  we  describe  our  REF  framework  based  on  a 
probabilistic  model  in  detail.  Our  REF  framework  includes  three 
steps: 

(a)  Step  1:  We  first  build  a  candidate  documents  set  related  to 
the  topic  and  extract  all  candidate  strings.  Specifically,  we 
search  the  corpus  with  the  narrative  of  the  topic  based  on 
BM25  [2]  to  find  out  the  top  5000  documents  as  the 
candidate  documents  set  for  the  topic.  Then,  all  the  anchor 
texts  in  the  candidate  documents  set  and  the  titles  of  the 
candidate  documents  are  extracted  as  the  candidate  strings. 

(b)  Step  2:  We  build  the  description  document  for  each 
candidate  string  and  obtain  all  candidate  entities  for  the 
topic.  Specifically,  for  each  candidate  string  obtained  in 
step  1,  we  extract  the  top  100  sentences  including  the 
candidate  string  (These  sentences  are  selected  from  the 
candidate  documents  set  and  the  selection  method  is  based 
on  a  specific  summarization  method  [4])  from  the  candidate 
documents.  All  the  100  sentences  are  assembled  together  as 
the  description  document  for  each  candidate  string.  Note 
that  the  candidate  strings  with  no  more  than  five  sentences 
including  it  are  discarded.  Finally,  applying  Stanford  NER 
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[1]  to  the  description  documents  of  the  candidate  strings, 
candidate  entities  can  be  obtained  by  filtering  out  the 
candidate  strings  with  TTS  (Target  Type  Support,  TTS  is 
defined  by 

_  count  the  tag  of  the  string  with  target  type 

1  lo  — - 

count  the  occurrence  of  the  string 

(c)  Step  3:  We  rank  all  the  candidate  entities  by  our 
probabilistic  model  and  output  results.  Specifically,  we 
using  our  probabilistic  model  to  rank  all  the  candidate 
entities  and  output  their  corresponding  anchor  links  (URLs 
for  titles). 

In  the  above,  we  describe  our  REF  framework  based  on  a 
probabilistic  model.  This  framework  is  very  effective,  the  key  part 
of  which  is  the  probabilistic  model.  In  the  following  section,  we 
will  introduce  the  probabilistic  model  in  detail. 

2.3  EF  Framework 

In  this  section,  we  describe  our  probabilistic  model.  In  our  REF 
framework,  all  candidate  entities  are  ranked  by  the 
probability  P(e  \  Q )  : 

P(e\Q)  oc  P(e,Q ) 

=  P(e,es,T,R) 

=  P(R\e,es,T)P(e,es,T)  (1) 

=  P(R\e,e\T)P(e\es,T)P(es,T) 
oc  P(R  \e,es,T)P(e\es,T) 

*  P(R\e,es)P(e\es,T) 

From  Equation  (1),  we  can  see  that  this  probabilistic  model 
consists  of  two  parts.  One  is  the  probability  P(R  |  e,  e' )  which 

reflects  the  relation  between  the  source  entity  P  and  the 
candidate  entity  e  ,  which  is  easy  to  be  computed  by  counting  the 

co-occurrence  of  8  ,  e  and  R  in  the  candidate  documents  set. 
Note  that  in  Equation  (1)  we  approximate  P(  R  \  8,  es ,  T)  with 
P(  R  |  e,  e s )  under  the  assumption  that  the  semantics  of  the 
target  type  T  can  be  deduced  based  on  the  semantics  of  the 
source  entity  &'  and  the  relation  R  . 

The  other  is  the  probability  P(e\  8s  ,T)  which  reflects  the 

relevance  between  the  candidate  entities  and  the  topic.  In  the 
following,  we  describe  our  method  to  estimate  the 

probability  P(e  \  8  ,  T )  .  This  method  includes  three  steps: 

(a)  Step  1:  We  build  the  description  document  for  the  narratives 
of  each  topic.  Specifically,  for  each  narrative,  we  extract  the 
top  100  sentences  from  the  top  100  documents  of  the 
candidate  documents  set  (One  sentence  is  selected  from  each 
document  and  the  selection  method  is  based  on  a  specific 
summarization  method  [4]).  All  the  100  sentences  are 
combined  together  as  the  description  document  for  each  topic. 

(b)  Step  2:  We  compute  the  similarity  between  the  description 
document  of  each  candidate  entities  and  the  description 


document  of  the  topic.  Note  that  cosine  similarity  is  used 
here. 

(c)  Step  3:  We  compute  the  probability  P(  R  \  e,  e')for  each 
candidate  entity.  Specifically,  we  using  the  cosine  similarity 
between  the  description  document  of  each  candidate  entity 
and  the  description  document  of  the  topic  normalized  with 
the  sum  of  all  the  cosine  similarity  to  approximate  the 

probability  P(e  \  e  ,  T )  . 

In  this  method,  we  use  the  cosine  similarity  between  the 
description  document  of  each  candidate  entity  and  the  description 
documents  of  the  topic  normalized  with  the  sum  of  all  the  cosine 

similarity  to  approximate  the  probability  P(e  \  6s  ,T)  ,  which 
makes  sense  since  that  it  has  been  proved  by  most  research  that 
the  snippets  are  good  descriptions  for  queries  [3],  So  the  cosine 
similarity  between  the  description  document  of  each  candidate 
entity  and  the  description  documents  of  the  topic  reflects  the 
relevance  between  the  candidate  entity  and  the  topic. 

3.  Conclusion  and  Future  work 

This  paper  addresses  the  problem  of  REF,  which  aims  to  perform 
entity-related  search  on  Web  data  and  address  common 
information  needs  that  are  not  that  well  modeled  as  ad  hoc 
document  search.  In  this  paper,  a  novel  framework  was  proposed 
based  on  probabilistic  model  to  entities  finding  in  a  Web 
collection.  This  model  consists  of  two  parts.  One  is  the  probability 
indicating  the  relation  between  the  source  entity  and  the  candidate 
entities.  The  other  is  the  probability  indicating  the  relevance 
between  the  candidate  entities  and  the  topic.  Experiments  on  the 
ClueWeb09  dataset  demonstrated  the  effectiveness  of  our  method. 
The  average  P@10  and  NDCG  of  our  method  is  0.2350  and 
0.2103  respectively.  However,  much  work  is  stills  needed  to  be 
conducted.  In  the  future,  we  will  conduct  research  on  the 
generation  of  the  candidate  entities  and  so  on. 
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